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ABSTRACT 

We  developed  a  dataflow  framework  which  provides  a  basis  for  rigorously  defining  strate¬ 
gies  to  make  use  of  runtime  preprocessing  methods  for  distributed  memory  multiprocessors. 

In  many  programs,  several  loops  access  the  same  off-processor  memory  locations.  Our 
runtime  support  gives  us  a  mechanism  for  tracking  and  reusing  copies  of  off-processor  data. 
A  key  aspect  of  our  compiler  analysis  strategy  is  to  determine  when  it  is  safe  to  reuse  copies  of 
off-processor  data.  Another  crucial  function  of  the  compiler  analysis  is  to  identify  situations 
which  allow  runtime  preprocessing  overheads  to  be  amortized.  This  dataflow  analysis  will 
make  it  possible  to  effectively  use  the  results  of  interprocedural  analysis  in  our  efforts  to 
reduce  interprocessor  communication  and  the  need  for  runtime  preprocessing. 


*  Research  was  supported  by  the  National  Aeronautics  and  Space  Administration  under  NASA  Contr2tct 
No.  NASl-18605  whUe  the  authors  were  in  residence  at  the  Institute  for  Computer  Applications  in  Science 
and  Engineering  (ICASE),  NASA  Langley  Research  Center,  Hampton,  VA  23665. 
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1  Introduction 


We  present  a  dataflow  framework  that  can  be  employed  to  systematically  make  use  of  runtime 
preprocessing  methods  aimed  at  loops  in  which  some  array  references  are  made  through  a 
level  of  indirection.  The  dataflow  framework  we  present  here  pertains  to  collections  of 
loops  with  no  loop-carried  data  dependences  or  with  only  accumulation  type  dependencies. 
Such  loops  are  often  referred  to  as  data-parallel  loops,  and  are  the  primary  target  of  the 
Fortran  D  compiler  [4].  The  type  of  irregular  loops  we  are  trying  to  handle  are  typically 
found  in  unstructured  mesh  explicit  and  multigrid  solvers,  molecular  dynamics  codes,  and 
some  sparse  iterative  linear  systems  solvers. 

In  distributed  memory  machines,  large  data  arrays  need  to  be  partitioned  between  lo¬ 
cal  memories  of  processors.  These  partitioned  data  arrays  are  called  distributed  arrays. 
Long  term  storage  of  distributed  array  data  is  assigned  to  specific  memory  locations  in  the 
distributed  machine.  In  many  irregular  problems,  we  can  reduce  the  amount  of  data  to  be 
communicated  by  using  a  partitioning  algorithm  that  individually  assigns  each  array  element 
to  a  specific  processor.  Furthermore,  these  machines  often  have  a  non-trivial  communications 
latency  or  startup  cost.  Therefore,  eflSciency  demands  that  information  to  be  transmitted 
should  be  collected  into  relatively  large  messages;  this  in  turn  implies  that  the  elements  to 
be  sent  and  received  by  each  processor  should  be  precomputed.  In  irregular  problems,  the 
conununications  pattern  depends  on  the  input  data,  typically  because  of  some  indirection 
in  the  code.  In  this  case,  it  is  not  possible  to  predict  at  compile  time  what  data  must  be 
prefetched. 

We  treat  this  lack  of  information  by  transforming  the  original  parallel  loop  into  two 
constructs  called  an  inspector  and  executor  [6,  7].  During  program  execution,  the  inspector 
examines  the  data  references  made  by  a  processor,  and  calculates  what  off-processor  data 
needs  to  be  fetched  and  where  that  data  will  be  stored  once  it  is  received.  The  executor  loop 
then  uses  the  information  from  the  inspector  to  implement  the  actual  computation.  The 
Fortran  D  compiler  now  under  development  at  Rice  University  performs  these  tasks  by  calls 
to  the  PARTI  library  built  at  ICASE. 

The  PARTI  primitives  (Parallel  Automated  Runtime  Toolkit  at  ICASE)  were  designed 
to  ease  the  implementation  of  irregular  computational  problems  on  parallel  architecture 
machines  by  relieving  the  user  or  compiler  writer  of  having  to  deal  with  many  low-level 
issues.  These  procedures 

1 .  Coordinate  interprocessor  data  movement, 

2.  Manage  the  storage  of  and  access  to  copies  of  off-processor  data,  and 

3.  Support  a  shared  name  space  by  building  a  distributed  translation  table  [9]  to  store 
the  local  address  and  processor  number  for  eaM:h  distributed  array  element. 

This  functionality  can  be  used  directly  to  generate  inspector /executor  pairs.  Each  inspector 
produces  a  communications  schedule,  which  is  essentially  a  pattern  of  communication  for 
gathering  or  scattering  data.  Hash  tables  are  used  to  avoid  repeatedly  communicating  the 
same  array  elements.  The  executor  has  embedded  PARTI  primitives  to  gather  or  scatter  data. 
The  primitives  are  designed  so  that  the  final  parallel  code  remains  as  close  as  possible  to  the 
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original  sequential  code.  The  primitives  issue  instructions  to  gather,  scatter  or  accumulate 
(e.g.  scatter  followed  by  add)  data  according  to  a  specified  schedule.  The  latency  or  start-up 
cost  is  reduced  by  packing  several  small  messages  with  the  same  destinations  into  one  large 
message.  Significant  work  has  gone  into  optimizing  the  gather,  scatter  and  accumulation 
communication  routines  for  the  iPSC/860.  It  is  not  the  purpose  of  this  paper  to  describe 
the  design  and  implementation  of  PARTI  in  great  detail;  information  on  this  can  be  found 
elsewhere  [1]. 

Our  runtime  support  makes  it  possible  to  track  and  reuse  off-processor  data  copies  [2]. 
We  generate  incremental  communications  schedules  to  obtain  only  those  off-processor  data 
not  requested  by  a  given  set  of  pre-existing  schedules.  This  gives  us  the  runtime  support  we 
need  to  combine  and  hoist  gather,  scatter  and  accumulate  procedures.  Removal  of  duplicates 
is  achieved  by  using  a  hash  table.  In  a  mesh  solver,  for  example,  the  off-processor  data  to 
be  accessed  by  the  edge  schedule  is  first  hashed  using  a  sinjple  hash  function.  Next  all  the 
data  to  be  accessed  during  the  faceJoop  is  hashed.  At  this  point  the  information  that  exists 
in  the  hash  table  allows  us  to  remove  all  the  duplicates  and  form  the  incremental  schedule. 

The  data  flow  framework  to  be  described  here  aims  at  providing  good  information  about 
what  data  are  needed  at  which  points  in  the  code,  along  with  information  about  what  live 
off-processor  data  are  available.  At  compile  time,  we  compute  global  flow  information  about 
the  communication  characteristics  of  the  loops  around  a  flow  graph.  This  framework  bears 
similarities  to  classical  techniques  such  as  common  subexpression  elimination,  loop  invariant 
code  motion,  and  dead  code  elimination.  The  framework  provides  a  basis  for  determining  at 
compile  time 

•  Where  communication  schedules  are  to  be  generated, 

•  Where  gather,  scatter,  and  accumulate  operations  are  to  be  placed,  and 

•  When  incremental  schedules  may  be  employed. 

In  our  data  flow  analysis,  some  of  the  variables  reflect  inherent  properties  of  the  analyzed 
program,  while  others  calculate  the  results  of  heuristics  we  employ  in  order  to  producing  the 
gather  and  scatter  operations.  Our  heuristics  aim  to 

•  Exploit  situations  where  we  can  reuse  communications  schedules,  and  to 

•  Remove  duplicate  communications  by  combining  and  hoisting  gather,  scatter  and  ac¬ 
cumulate  procedures. 

The  rest  of  the  paper  is  organized  as  follows.  Section  2  provides  some  definitions  and 
terminology  for  the  framework.  Section  3  introduces  the  local  flow  variables,  followed  by 
global  variables  in  Section  4  and  result  variables  in  Section  5.  Section  6  gives  an  extension 
of  the  framework  for  handling  reduction  operations.  In  Section  7,  we  work  through  the  data 
flow  variables  for  a  program  example,  which  is  a  simplified  version  of  a  mesh  solver  for  which 
Section  8  gives  concrete  experimental  results  illustrating  the  effect  of  exploiting  data  flow 
information.  Section  9  contains  concluding  remarks. 
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2  Basics  of  the  Framework 


This  section  describes  the  scope  of  the  framework  developed  in  this  paper  and  defines  some 
concepts  used  in  later  sections. 

2.1  The  domain 

Even  though  our  implementation  can  handle  other  cases  as  well,  we  assume  here  for  presenta¬ 
tion  purposes  that  all  indirect  references  in  the  program  text  are  of  '.lie  form  {arTay){{indtxjarray){{looj. 
For  many  programs,  this  can  actually  be  achieved  by  forward  substituting  array  indices.  For 
example,  the  code  sequence  j=ia(i) ;  x(j)=10  would  be  treated  eis  x(ia(i))=10.  Arrays 
which  are  never  referenced  indirectly  are  assumed  to  be  analyzed  using  other  methods  [3] 
prior  to  this  analysis.  References  with  multiple  (but  bounded)  levels  of  indirection  will  re¬ 
quire  more  levels  of  complexity  in  the  dataflow  framework;  we  do  not  consider  potentially 
unbounded  indirection,  as  is  found  in  linked  lists. 

Let  V  be  the  set  of  arrays  which  are  accessed  indirectly.  We  assume  that  each  reference  r 
to  some  t>  €  V  is  contained  in  some  loop(s).  Let  L  be  the  set  of  loops  which  directly  enclose 
an  occurrence  of  some  v  €  V.  We  assume  that  no  I  G  L  encloses  any  other  m  G  L.  One  set 
of  data  flow  variables  is  computed  for  each  element  of  a  set  of  nodes,  N.  It  is  N  =  L\J  P, 
where  P  contains  one  entry  pad,  lentry,  and  one  exit  pad,  lexHt  for  each  loop  I  ^  L  containing 
some  V  G  L.  Furthermore,  we  assume  I  entry  (lexit)  to  be  executed  before  (after)  /  iff  /  has  at 
least  one  iteration.  The  framework  operates  on  a  loop  flow  graph  G  =  {N,  E)  of  the  program, 
where  the  edges  E  are  simple  control  flow  edges. 

For  example,  if  /  is  an  outer  time  stepping  loop  which  does  not  directly  contain  any 
irregular  array  references  but  which  contains  a  loop  V  over  mesh  edges,  then  /'  is  represented 
as  a  node  in  G  and  /  is  represented  as  some  interval  in  G.  In  the  following,  loop  refers  to 
elements  of  N,  t.e.,  it  may  denote  a  pad  as  well. 

Future  work  will  present  a  complete  framework  in  which  summary  information  is  built  in 
a  bottom-up  fashion  similar  to  array  kill  information  [3].  Finally,  this  paper  only  discusses 
the  case  where  the  summarized  loops  have  no  data  dependences,  except  for  commutative 
and  associative  reductions  which  are  handled  specially. 

2.2  Array  portions 

Array  portions  are  a  central  concept  to  the  framework  and  best  introduced  by  an  example. 

A  portion  x(ia(l:n))  consists  of  the  array  x  and  the  index  set  ia(l:n).  This  index  set  in 
turn  consists  of  the  index  array  ia  and  the  range  (l:n),  which  heis  the  lower  bound  1  and 
the  upper  bound  n. 

Several  portions  may  be  taken  from  the  same  array  or  may  have  the  same  index  set. 

The  index  range  does  not  have  to  be  known  at  compile  time,  so  the  bounds  may  contain 
symbolics.  No  assumptions  are  made  about  whether  different  portions  taken  from  the  same 
array  are  disjoint  or  whether  they  overlap  each  other  partially  or  completely.  This  allows 
analyzing  symbolic  index  ranges,  but  it  requires  the  analysis  to  be  conservative  when  using 
intersection  and  set  subtraction  in  the  equations. 
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The  framework  can  be  implemented  using  bit  vectors,  each  bit  representing  one  array 
portion.  The  length  of  these  bit  vectors  is  bounded  by  the  number  of  indirect  array  references 
(i.e.,  it  is  linear  in  program  size),  and  all  equations  given  here  are  rapid  [5].  Therefore, 
using  bit  vectors  for  the  analysis  gives  us  good  asymptotic  running  times.  However,  for 
our  examples  (and  probably  also  in  a  practical  implementation),  it  seems  advantageous  to 
represent  the  different  flow  variables  as  bit  matrices.  The  rows  of  a  bit  matrix  correspond  to 
the  arrays  of  the  portions  represented  {e.g.  x  in  x(ia(l:n))),  while  the  columns  correspond 
to  the  index  sets  (ia(l:n)).  Theoretically  that  representation  increases  variable  sizes  from 
linear  in  program  size  to  quadratic  in  program  size,  so  the  feasibility  of  this  approach  depends 
on  how  programs  behave  in  practice.  However,  this  representation  makes  potential  schedule 
sharing,  for  example,  very  easy  to  recognize  by  determining  which  index  set  columns  have 
more  than  one  entry. 

We  assume  that  all  indirect  array  references  are  identified  in  a  previous  pass  over  the 
program  text  and  construct  bit  vectors/ matrices  accordingly.  For  the  analysis  we  also  assume 
that  a  (identity)  dummy  index  array  is  inserted  for  all  direct  array  references. 

2.3  Operations  on  portions 

To  aid  the  distinction  between  portions,  indirect  array  references,  array  elements,  and  sets 
of  all  these  constructs,  we  make  a  short  digression  to  introduce  the  conversion  operators 
elements-of  p  (where  p  is  some  portion  or  set  of  portions),  denoted  p,  and  references-of  p, 
denoted  p.  Assume  we  are  given 

•  an  array  x; 

•  an  index  array  ta(l  :  5); 

•  portions  p  —  x(ia(pi  :  Pu)),  9  =  x(ia(qi :  9„)),  r  =  x(ta(n  :  r^));  and 

•  sets  of  portions  A  =  {p,g},  B  =  {p},  C  =  {r}. 

We  can  re<ison  about  A,  B,  and  C  at  different  levels.  For  example,  if  the  index  ranges  of  the 
portions  are  only  known  symbolically,  one  can  determine  at  the  portion  level  that 

ADB 

must  hold,  but  no  other  relationships  can  be  proven  among  the  sets  of  portions.  However,  if 
we  know  for  example  that 

Pi  —  —  3, 9/  ~  3, 9u  5,  Tj  3,  Tu  —  4, 

then  the  elements-of  operator,  ”,  can  be  applied  to  the  portions  and  to  the  sets  thereof,  to 
obtain 

A  =  x(ia(l  :  5)),B  =  x(ta(l  :  3)),C  =  x(ia(3  :  4)). 

The  scope  of  '  is  extended  to  set  operators  and  predicates,  so  we  can  assert  at  the  element 
level  that 

ADB, 

ADC. 
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Assume  furthermore  that  we  know  the  values  of  the  index  array  to  be 


ia(l:5)  =  l,4,3,l,4. 

Then  the  references-of  operator,  obtains 

^  =  l(l,3,4),  S  =  »(l,3,4),&  =  x(l,3). 

With  this  knowledge,  we  conclude  at  the  reference  level  that 

ADBhc. 

We  can  see  how  the  set  relationship  predicates  change  over  the  different  levels  of  reaisoning, 
with 

XDY=>  XDY  =>  X 

Another  interesting  operation  in  this  context  is  set  subtraction: 

•  A\B  =  {p,q}\{p}  =  {?},  which  is  x(l,3,4); 

•  A\B  =  A\B  =  x(ia{l  :  5))\x(ta(l  :  3))  =  a:(ta(4, 5)),  which  is  x(l,4); 

•  a\b=  A  \  ^  =  x(1,3,4)\x(1,3,4)  =  0. 

As  described  in  Section  5,  A\B  (and  the  corresponding  sets  at  lower  levels)  can  be  viewed 
as  a  so  called  incremental  schedule,  which  indicates  what  has  to  be  communicated  if  A  is 
needed  and  B  is  already  available  in  local  memory.  We  can  see  immediately  the  consequences 
for  this  incremental  schedule  in  the  example:  the  more  we  know  about  portions,  the  less  we 
might  have  to  communicate.  Formally, 

X\Y  D  X\Y  2  x\y. 

To  aid  formulating  conservative  equations  which  still  offer  the  possibility  to  exploit  any 
knowledge  potentially  available  at  compile  time,  we  introduce  some  set  operators  which  map 
sets  of  portions  into  sets  of  portions.  Given  some  set  of  portions  SET,  we  define 

SET*  =  {p  I  p  has  same  array  as  some  q  €  SET}, 

SET''  =  {p  I  SET  might  affect  p} 

=  {p  I  pf^SET  ^  0  cannot  be  disproven} 

C  SET, 

SET^  =  {p  I  SET  contains  p} 

=  {p  I  pCSET  can  be  proven} 

2  SET, 

SET  =  {p  I  SET  might  partially  touch  part  of  p} 

=  SET  \  SET. 
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SET*  can  be  deriv'^d  easily  from  SET  by  just  reducing  a  bit  matrix  (array  names  by 
index  sets)  to  a  bit  column  (array  names)  using  row-wise  OR.  From  there  we  can  conserva¬ 
tively  approximate  SET^^  SET^,  and  SET^  directly,  or  we  can  employ  further  compile  time 
knowledge  about  how  portions  relate  to  each  other  if  available.  Either  way,  we  do  not  leave 
the  portion  space  as  given  in  the  program,  t.e.,  we  can  still  represent  these  sets  with  binary 
bit  matrices. 

For  example,  let  the  portions  p,  q,  r  be  defined  as  above,  and  let  D  =  {q}.  Assuming  no 
compile  time  knowledge  at  the  element  or  reference  level,  we  can  conservatively  assume  that 
D*  =  {p,q,r},  =  {p,  9,  r},  W  =  {9},  and  EJP  =  {p,  r).  With  knowledge  at  the  element 

level,  we  have  D*  =  {p, 9, r},  EH  =  {p,9,r},  [P  =  {9, r},  and  ET  =  {p}.  Reference  level 
knowledge  gives  EE  =  {p,  9,  r},  IP  =  {p,9,r},  IP  =  {p,  9,  r},  and  =  0. 

A  point  to  keep  in  mind  when  reasoning  about  which  elements  are  contained  in  which 
portions  and  how  portions  relate  to  each  other  is  that  two  portions  p,  9  might  globally 
contain  the  same  set  of  array  elements  of  some  array  X,  but  that  locally  a  given  processor 
sees  different  parts  of  X  for  p  and  9.  (This  applies  to  Ihs  occurrences  as  well,  since  we  apply 
the  owner  computes  rule  based  on  index  array  ownerships,  not  on  data  array  ownerships; 
otherwise  we  would  not  need  a  SCATTER  operation).  In  this  case  there  has  communication 
to  occur  if  for  example  we  first  define  p  and  then  use  9.  The  important  consequence  is  that 
we  must  apply  "  and  *  based  on  the  share  of  each  processor. 

Furthermore,  we  have  to  keep  different  distributions  of  arrays  and  index  arrays  in  mind 
for  the  analysis.  For  example,  we  cannot  reuse  a  schedule  between  two  portions  which  have 
the  same  index  set,  but  whose  arrays  are  distributed  differently.  For  sake  of  simplicity, 
however,  we  assume  in  this  paper  that  all  arrays  are  conformable. 

3  The  Local  Flow  Variables 

We  define  the  local  flow  variables  to  be  the  components  of  the  data  flow  equations  which  are 
determined  by  local  analysis  of  each  loop.  In  the  following, 

•  /  stands  for  an  arbitrary  loop, 

•  p  denotes  a  portion  x(ia(lb:ub)), 

•  an  occurrence  of  p  is  either  a  use  of  p  or  a  definition  of  p,  and 

•  variable  or  flow  variable  stand  for  data  flow  variables. 

We  begin  with  two  variables,  REF  and  DEF,  which  are  familiar  from  standard  live 
variable  analysis.  A  point  to  keep  in  mind,  however,  is  that  here  live  does  not  refer  to 
whole  arrays,  but  to  limited  portions  thereof  instead.  Also,  there  may  be  conditionals  in  the 
loops  generating  the  variables,  which  can  be  handled  by  annotating  portions  with  (symbolic) 
guards  applying  to  whole  portions  or  elements  thereof. 

For  each  loop  /,  we  define 

REF(l):  the  portions  live  on  entry  to  /,  and 
DEF(l):  the  portions  defined  in  /. 
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Formally; 


REF{1)  =  {p  I  first  stmt  containing  p  in  /  reads  p}, 

DEF{1)  =  {p  I  some  stmt  in  /  assigns  to  p}. 

To  aid  the  extension  to  reduction  statements  discussed  in  Section  6,  we  do  not  base  the 
further  development  of  the  framework  on  REF  and  DEF  directly,  but  replace  them  with 
GET  and  PUT.  These  variables  are  used  to  derive  the  portions  which  have  to  be  buflPered 
locally.  We  define 

GET(l);  the  portions  referenced  in  /  from  local  memory  (the  buffer), 

PUT(l):  the  portions  written  by  I  into  the  buffer,  and 

BUF(l):  the  portions  which  will  be  buffered  on  exit  from  1. 

The  equations  (which  will  be  redefined  in  Section  6): 

GET{1)  =  REF{1), 

PUT{1)  =  DEF(l), 

BUF{1)  =  GET{1)  U  PUT{1). 

We  also  have  to  compute  the  live  ranges  of  index  sets,  otherwise  we  might  accidentally 
try  to  communicate  a  portion  before  or  after  the  program  region  where  the  index  set  of  that 
portion  is  available  (i.e.,  before  the  index  set  is  defined  or  after  it  is  overwritten  with  other 
values).  We  define 

IND(l):  the  portions  whose  index  sets  may  be  computed  (in  part)  by  /. 

KILL(l):  the  portions  that  may  be  made  invalid  by  /,  either  because  /  assigns  an  overlapping 
part  of  the  array  or  /  reassigns  the  index  set.  GATHER  operations  can  never  be  hoisted 
above  /  for  these  portions. 


FLUSH(l):  the  portions  that  may  be  read  by  /  or  whose  index  sets  may  be  reassigned  by 
1.  SCATTER  operations  can  never  be  delayed  until  after  I  for  these  portions. 

Formally: 


IND{1)  =  {p  I  p  has  index  set  ia(i„i,„:i,nox) 
and  /  assigns  to  ia}, 

KILL{1)  =  IND{1)  U  DEF^H), 

FLUSH{1)  =  IND{1)  U  REr{l). 


4  The  Global  Flow  Variables 

The  computation  of  the  global  flow  variables  constitutes  the  meat  of  the  data  flow  frame¬ 
work.  Here  we  actually  propagate  knowledge  about  the  communication  characteristics  of 
the  loops  around  in  the  flow  graph.  The  problems  addressed  here  have  elements  from  Com¬ 
mon  Subexpression  Elimination,  Loop  Invariant  Code  Motion,  and  Dead  Code  Elimination. 
As  already  mentioned,  all  equations  given  here  are  rapid,  so  we  can  expect  to  solve  them 
efficiently  using  simple  iterative  techniques.  All  global  variables  are  initialized  to  0. 
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4.1  Fetches 


The  strategy  for  determining  where  to  place  GATHER  operations  is  based  on  the  following 
definitions: 

the  portions  which  are  needed  in  /  or  in  all/any  of  the  following  loops. 

BUFFD(l):  the  portions  which  are  already  available  when  entering  /.  Here  we  assume  that 
buffers  are  not  flushed  unless  the  data  in  them  may  be  invalid,  because  either  the  data 
array  or  the  index  array  has  been  assigned  to; 

HOIST(l):  the  portions  for  which  a  GATHER  should  be  hoisted  ahead  of  /. 

FETCH(l):  the  portions  which  are  needed  in  /,  or  which  are  needed  in  some  later  loop  and 
can  be  hoisted  before  /. 

The  equations: 

L/KF“"(/)  =  GET{1)  U  n  {LIVE'^^\s)  \  KILL{1)), 

S^*tLCCs{l) 

LIVE^^^il)  =  GET{1)  U  U  (L/FF“’‘>'(s)  \  KILL{1)), 

<g«ucc«(/) 

BUFFD(l)  =  BUF{1)  U  f]  {BUFFD{p)  \  K1LL{1)), 

p€preds(l) 

HOISTH)  =  n  {UVE^'\p)  U  BUFFD{p)), 

p^preds{l) 

FETCH H)  =  GET{1)  U 

n  {HOIST{s)r\  FETCH {s)). 

s^suecs{l) 

At  this  point,  we  have  identified  candidate  locations  in  the  program  for  placing  GATHER’s. 
In  short,  whenever  a  portion  appears  in  a  FETCH{\)  set,  then  that  portion  can  be  GATHER’ed 
before  /  and  will  be  used  before  it  is  assigned.  The  final  placement  will  be  determined  by 
the  result  flow  variables  discussed  in  Section  5. 

Note  that  we  can  not  only  distinguish  the  variables  defined  so  far  by  whether  they  are 
local  or  global,  but  we  can  also  classify  them  into  either  reflecting  fixed  properties  inherent 
of  the  analyzed  program,  or  being  subject  to  heuristics.  Furthermore,  this  classification  can 
be  done  based  either  on  the  definition  of  the  variable,  ».e.,  how  it  is  defined  in  terms  of 
other  variables,  or  on  the  actual  values  of  the  variable.  For  example,  HOIST  is  currently 
defined  so  that  we  combine  and  hoist  up  GATHER’s  as  much  as  possible,  subject  to  the 
constraint  that  we  never  want  to  overcommunicate  (even  if  that  might  be  advantageous  in 
some  cases,  for  example  in  saving  schedules).  If  we,  for  example,  replace  the  LIVE^^^  in 
the  definition  of  HOIST  with  LIVE’"*^,  we  could  hoist  up  communication  even  further,  at 
the  expense  of  possibly  communicating  unnecessary  data.  In  other  words,  the  definition  of 
HOIST  is  a  matter  of  heuristics.,  which  is  not  the  case  for  the  other  definitions  so  far.  For 
other  variables  dependent  on  HOIST  (so  far,  FETCH  is  the  only  such  variable),  their  values 
become  a  matter  of  the  chosen  heuristics  as  well,  but  not  their  definition. 
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4.2  Stores 

The  high  level  strategy  for  determining  where  to  place  SCATTER  operations  is  relatively 
similar  to  the  one  for  placing  GATHER’s.  Note  that  we  do  not  have  to  scatter  portions 
(t.c.,  send  them  back  to  the  owner)  if  they  are  used  only  locally,  which  is  why  we  restrict 
our  attention  to  GE'P  instead  of  GET.  The  definitions: 

HiN®“y/aU(l) /HOUT*“y/»^l(l):  the  portions  touched  by  a  reference  on  any/all  of  the 
paths  starting  at  the  entry/exit  of  1. 

DELAY(l):  the  portions  which  should  be  scattered  in  a  later  loop,  or  which  are  dead  on 
exit. 

STORE(l):  portions  which  are  assigned  to  in  /,  or  which  were  assigned  to  earlier  and  whose 
scatter’s  can  be  hoisted  into  1. 


=  GEr{l)  U  H0UT'^“{1), 

H0UT^“{1)  =  n  HIN‘^“{s), 

aSauccs(l) 

=  GETil)  U  //Ol/r“’‘«(s), 

HOUr^m)  =  (J 

s&3ucca{l) 

DELAY (1)  =  f]  {HOUT'^‘^{s)uWUr^)\ 


U  FLUSH{s), 

3^aucca(l) 

ST0RE{1)  =  PUT{l)U 

fl  {DELAY{p)r)  STORE  ip)). 

pepre(i3(l) 


Our  heuristic,  here  defined  by  DELAY.,  is  to  combine  and  delay  SCATTER’s  as  much  as 
possible,  subject  to  the  constraint  that  we  never  scatter  data  which  are  dead. 


5  The  Result  Flow  Variables 

The  rtsult  flow  variables  given  in  this  section  are  computed  after  solving  the  equations  given 
so  far.  They  should  accurately  describe  which  portions  have  to  be  gathered  before  entering 
I  or  scattered  after  leaving  I  (possibly  using  reductions).  Here  we  want  to  take  previous  and 
succeeding  loops  and  their  communication  requirements  into  account  as  well. 

5.1  Fetches 

Similarly  to  FETCH,  GATH{\)  describes  which  portions  have  to  be  in  local  memory  before 
entering  1.  However,  it  excludes  portions  which  must  already  be  locally  available  either 
by  previous  gathers  or  by  previous  calculations.  Furthermore,  we  may  not  only  exclude 
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these  available  data  on  a  portion  by  portion  basis,  but  also  on  an  element  by  element  basis. 
In  other  words,  if  we  know  that  a  portion  x(ia(im,„: i,„ax))  i?  buffered,  then  we  might 
not  only  eliminate  gathers  of  exactly  that  portion,  but  we  can  also  save  on  a  gather  of  a 
potentially  overlapping  portion  x(ia(jm«n- jmaa:))  by  gathering  only  the  increment  from  the 
first  portion  to  the  second  one. 

For  that  purpose  we  compute  incremental  schedules  using  the  \  operator  as  introduced 

in  Section  2;  recall  that  A  \B  contains  exactly  those  references  which  appear  in  the  portions 
in  A  but  do  not  appear  in  any  of  the  portions  in  B.  Note  that  this  operator,  unlike  the 
\,U,n  used  in  the  flow  equations  so  far,  brings  us  out  of  the  fixed  space  of  sets  of  portions 
appearing  in  the  program  text,  and  applying  it  repeatedly  can  lead  to  an  explosion  of  the 
number  portions  we  have  to  be  able  to  represent  (nestings  of  increments  of  intersections  of 
increments,  etc.).  Applying  this  operator  just  once,  however,  leads  to  sets  which  can  still 
be  represented  by  3- valued  “bit”  vectors/matrices;  in  addition  to  included/not  included,  we 
also  need  explicitly  excluded. 

Note  also  that  A  \B  =  0  is  possible  even  for  A  \  B  ^  This  reflects  for  example  the 
case  where  we  express  a  mesh  and  its  boundary  as  different  portions  of  the  same  array;  the 
portions  are  distinct,  but  one  contains  a  subset  of  the  other. 

The  equation: 

GATH{1)  =  FETCHil)  \ 

fl  {FETCH{p)U  BUFFDip)). 

p^pTeds(l) 


5.2  Stores 

The  SC  ATT  variables  are  derived  from  the  STORE  variables,  except  that  we  eliminate 
unnecessary  scatters  by  excluding  portions  which  either  will  be  scattered  later,  or  which  are 

not  at  least  potentially  live  (using  HOUT’^’^^).  Again,  we  use  the  set  operator  \  to  support 
incremental  schedules. 

SCATT{1)  =  ST0RE{1)  \ 

n  {STORE{s)UHOUr^). 

9^SUCCS{1) 

Note  that  we  can  still  override  the  communication  patterns  obtained  by  global  analysis 
for  GATH  and  SCATT  by  just  substituting  the  local  counterparts  GET  and  PUT  for  them. 
Furthermore,  this  can  be  done  for  either  both  variables  or  just  one  of  them,  since  they  do 
not  rely  on  each  other,  but  merely  on  the  loop  properties. 

5.3  Schedules 

The  framework  described  so  far  gives  an  accurate  description  of  which  schedules  are  needed 
where.  Critical  for  the  overall  cost  associated  with  our  communications  is  also  the  generation 
of  these  schedules,  in  particular  where  the  schedules  are  generated.  However,  once  we  know 
the  communication  requirements,  schedule  computation  placement  appears  to  be  relatively 
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straightforward.  Therefore,  we  currently  use  the  simple  heuristic  of  generating  schedules  as 
soon  as  possible,  t.e.,  as  soon  as  the  necessary  index  arrays  are  available.  This  seems  to  work 
well  in  the  codes  we  have  considered  so  far. 


6  Reduction  Variables 

As  indicated  earlier,  the  framework  developed  so  far  can  be  extended  to  take  advantage  of 
reduction  statements  as  well.  The  portions  exclusively  appearing  in  reduction  statements 
can  be  treated  differently  from  other  defs  and  uses,  since  they  are  not  necessarily  brought 
into  local  memory  if  we  use  reduction  operations  like  SCATTERADD  or  SCATTERMULT. 
However,  portions  appearing  in  different  reduction  operations  within  one  loop  have  to  be 
brought  into  local  memory,  so  we  have  to  carefully  separate  the  portions  into  the  ones  used 
exclusively  in  ADD  reductions  and  the  ones  used  only  in  MULT  reductions: 

ADD{1)  =  {p  I  all  <7  G  are  only  added  to  in  /}, 

MULT{1)  =  {p  I  all  9  €  p^  are  only  multiplied  to  in  /}. 

We  derive  RED,  the  set  of  all  portions  which  are  used  exclusively  in  reduction  operations, 
and  redefine  GET  and  PUT  which  were  introduced  in  Section  3: 

RED{1)  =  ADD(l)  U  MULT{1), 

GET{1)  =  REF{1)  \  RED{1), 

PUT{1)  =  DEF{1)  \  RED{1). 

The  changes  so  far  have  eliminated  the  GATHER’s  and  SCATTER’s  for  portions  which 
appear  exclusively  in  reductions. 

We  now  define  another,  separate  framework,  which  computes  only  the  SCATTER_ADD’s 
(similarly  for  the  other  reductions).  This  ADD  framework  coexists  with  the  non-reduction 
framework  which  is  still  used  to  compute  communication  requirements  for  non-reduction 
operations.  The  redefined  variables  are: 

GETadd{1)  =  REF{l)\ADD{l), 

FLUSH  add(I)  =  mD(l)UGErADD{l), 

STORE add{1)  =  ADD{1)KJ  f] 

pGpreds{l) 

{DELAY add{p)  n  STORE add{p))- 

Corresponding  to  these  new  variables,  we  can  derive  HIIV'ado^,  HOUT'add^\  DELAY  add-, 
and  SCATTadd  with  the  same  equations  as  for  the  non-reduction  framework.  SC  ATT  add 
now  indicates  where  to  place  SCATTER_ADD’s. 

Like  for  the  non-reduction  framework,  we  can  override  the  result  variable  selectively 
with  their  local  counterpart,  which  is  here  ADD.  Note  that  the  flow  equations  for  ADD  are 
defined  independently  of  other  reductions.  This  simplifies  extending  the  framework  to  other 
reduction  operations  by  just  adding  flow  variables  and  equations,  without  having  to  modify 
existing  ones  (except  extending  RED). 
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G  ATHER(z(nf  1  ( 1  ;nf  )),z(nf2(  1  ;nf ))) 
do  iii  =  1,  itime 

GATHER(y(iel(l:ne)),y(ie2(l:ne)), 
y(nfl(  1  :nf  )),y(nf2(  1  :nf ))) 
do  i  =  1,  ne 

x(iel(i))  =  x(iel(i))  +  y(ie2(i)) 
x(ie2(i))  =  x(ie2(i))  +  y(iel(i)) 
enddo 
do  j  =  1,  nf 

x(nfl(j))  =  x(nfl(j))  +  y(nf2(j))  +  z(nf2(j)) 
x(nf2(j))  =  x(nf2(j))  +  y(nfl(j))  +  z(nfl(j)) 
enddo 
do  k  =  1,  ne 

x(iel(k))  =  x(iel(k))  +  y(ie2(k)) 
x(ie2(k))  =  x(ie2(k))  +  y(iel(k)) 

enddo 

SCATTERADD(x(iel(  1  :ne)),x(ie2(  1  :ne)), 
x(nf  1  ( 1  :nf  )),x(nf2(  1  :nf ))) 
do  1  =  1,  nn 

y(i)  =  x(i) 

enddo 

enddo 

Figure  1:  Example  code,  communication  is  already  inserted  as  derived  by  the  framework. 

7  Example 

Figure  1  shows  an  example  code.  In  this  program,  we  have 

•  four  inner  loops,  /i,  /2»  hi  and  /4; 

•  one  entry  and  one  exit  pad,  Iq  and  h; 

•  three  array  names,  x,  y,  and  z; 

•  five  index  sets,  sj  =  iel(l  :  ne),  S2  =  ie2(l  :  ne),  sg  =  i/l(l  :  nf),  S4  =  i/2(l  :  nf), 
and  S5  =  identity{l  ;  nn); 

•  this  spans  a  bit  matrix  of  fifteen  portions,  X\  =  a:(si),  X2  =  x(s2)»  ■••1^5  =  z(s5),  twelve 
of  which  actually  occur  in  the  program  text. 

The  corresponding  flow  graph  is  shown  in  Figure  2. 

The  bit  matrices  of  the  resulting  local  flow  variables  are  shown  in  Figure  3.  A  matrix 
entry  for  a  particular  portion  p  and  a  flow  variable  VAR  is  defined  as  follows: 

“1”  -  p  is  included  in  VAR, 

-  p  is  not  included  in  VAR, 
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Figure  2:  Flow  graph  for  example  code. 


“0”  -  p  is  explicitly  excluded  from  VAR  {ass.  result  of  the  \  operator;  in  our  example,  there 
are  none  such  entries  due  to  the  simple  control  flow  structure). 

Figure  4  shows  the  global  and  result  variables.  Figure  5  shows  the  variables  for  the  ADD 
framework.  The  result  variables,  i.e.GATH  and  SC  ATT,  determine  where  the  GATHER 
and  SCATTER  operations  should  be  placed.  If  the  bit  representing  a  portion  is  set  in  the 
GATH  set,  then  a  GATHER  operation  for  that  portion  is  placed  at  the  beginning  of  that 
loop.  Similarly,  a  set  bit  in  the  SC  ATT  set  results  in  placement  of  a  SCATTER  operation 
(SCATTER_ADD  in  the  ADD  framework)  at  the  end  of  a  loop.  GATHER’s  and  SCATTER’s 
of  portions  with  identity  as  the  index  array  are  ignored.  This  is  valid  because  they  represent 
data  movement  from  a  processor  to  itself. 

We  do  not  show  here  the  optimizations  needed  to  generate  the  schedule  operations  (i.c., 
the  inspectors).  In  general,  the  method  is  to  identify  the  index  sets  used,  and  insert  the 
inspectors  at  the  birthpoints  of  those  sets.  The  first  step  can  be  done  by  inspection,  while 
the  second  is  a  simple  application  of  reaching  definition  analysis. 


8  Experimental  Results 

We  summarize  the  results  of  some  of  the  experiments  we  have  carried  out  to  evaluate  the  per¬ 
formance  impact  of  our  optimizations.  The  experiments  employed  an  explicit  unstructured 
mesh  solver  of  the  three  dimensional  Euler  equations  which  comprise  a  non-linear  system  of 
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Figure  3:  Local  flow  variables  for  example  code. 


five  differential  equations.  The  calculation  consists  of  a  sequence  of  loops  over  edges,  bound¬ 
ary  faces  and  nodes  of  an  unstructured  mesh.  The  code  was  originally  developed  by  Dimitri 
Mavriplis.  The  program  was  ported  to  the  Touchstone  Delta  using  Parti  primitives  [1,  2] 
and  the  code  was  run  to  simulate  a  variety  of  aircraft  configurations  under  a  range  of  test 
conditions.  While  the  port  was  carried  out  by  hand,  the  strategy  used  to  place  the  PARTI 
primitives  was  the  same  as  the  strategies  that  would  result  from  our  dataflow  framework. 
We  executed  a  number  of  different  versions  of  the  parallel  Euler  solver  to  give  an  idea  of  how 
the  code  generated  using  the  dataflow  framework  would  affect  performance. 

The  reader  should  note  that  the  simple  test  loops  presented  in  Section  7  were  for  ease 
of  exposition,  our  experimental  work  involved  a  full  unstructured  explicit  Euler  solver.  The 
example  code  shown  in  Figure  1  depicts  loops  which  are  motivated  by  the  actual  Euler  solver. 
The  loop  structure  of  the  example  code  can  be  derived  from  the  actual  solver  by  inlining  the 
function  calls.  The  main  loop  in  the  example  is  analogous  to  a  time-stepping  loop.  Arrays 
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Figure  4:  Global  flow  variables  and  result  flow  variables  for  example  code 
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Figure  5:  Flow  variables  for  ADD  framework  of  example  code. 


in  the  actual  Euler  solver  required  GATHER’ing  and  SCATTER’ing  at  different  levels,  as 
illustrated  by  the  y  and  z  arrays  in  the  example.  Unlike  our  example,  the  arrays  in  the  actual 
code  are  multidimensional;  incorporating  this  into  our  framework  only  requires  treating  the 
additional  dimension  as  a  regularly-accessed  array.  (This  dimension  has  a  size  of  five,  and 
all  elements  in  that  dimension  are  set  together.)  There  are  three  different  kinds  of  loops 
present:  two  over  the  edges  (/i  and  /a),  one  over  the  boundary  faces  (/2)  and  one  over  the 
nodes  (/4).  These  are  typical  of  the  variety  of  loop  types  found  in  the  actual  code.  In 
summary.  Figure  1  abstracts  many  of  the  complex  parts  of  the  actual  code  (such  as  control 
flow  and  irregular  computations),  while  ignoring  the  straightforward  parts  (such  as  short 
vectors  in  other  dimensions). 

The  test  case  we  report  here  involves  the  computation  of  a  highly  resolved  flow  over  a 
three-dimensional  aircraft  configuration.  The  mesh  contained  a  total  of  804,056  points  and 
approximately  4.5  million  tetrahedra.  We  believe  this  to  be  the  largest  unstructured  grid 
Euler  solution  attempted  to  date.  In  Figure  6,  we  depict  one  of  the  meshes  used  in  our 
experimentation  (we  do  not  show  the  804K  mesh  due  to  printing  and  resolution  limitations). 
For  this  case,  the  freestream  Mach  number  is  0.768  and  the  incidence  is  1.16  degrees.  We 
employed  the  recursive  spectral  partitioning  algorithm  to  carry  out  partitioning  [8,  10]. 
Partitioning  was  performed  on  a  sequential  machine  as  a  preprocessing  operation. 

We  present  timings  that  result  from  four  different  implementations  of  the  Euler  solver.  In 
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Figure  6:  Coarse  Unstructured  Mesh  about  an  Aircraft  Configuration  with  Single  Nacelle; 
Number  of  Points  =  106,064,  Number  of  Tetrahedra  =  575,986. 

two  cases,  we  employ  incremental  scheduling  and  aggressively  combine  and  hoist  gather  and 
accumulate  procedure  calls.  In  the  two  other  cases,  we  do  not  make  use  of  the  knowledge  we 
would  obtain  through  global  dataflow  analysis  in  placement  of  gather,  scatter  and  accumulate 
procedure  calls.  In  these  versions  we  place  gather  procedures  immediately  before  loops  that 
contain  irregular  array  references.  The  use  of  incremental  scheduling  leads  to  a  large  amount 
of  live  data  reuse.  For  instance,  one  step  of  Runge  Kutta  integration  in  the  experimental 
code  uses  the  flow  variables  in  a  sequence  of  three  loops  over  edges  followed  by  a  loop  over 
boundary  faces.  The  flow  variables  are  only  updated  at  the  end  of  the  entire  integration, 
rather  than  after  each  loop.  We  can  obtain  all  of  the  off-processor  flow  variables  needed  at 
the  beginning  of  the  step. 

The  indirection  arrays  (in  Figure  1  the  arrays  iel,  ie2,  ifl,  if 2),  with  which  we 
form  our  schedules,  usually  do  not  get  written  into  in  non-adaptive  calculations.  In  such 
situations  one  can  form  the  communication  schedules  before  any  actual  computation  begins, 
i.e.,  in  the  preprocessing  stage.  In  our  experimental  code,  the  edges  and  the  boundary  faces 
of  the  mesh  are  fixed  throughout  the  computation.  Hence,  we  can  very  easily  move  the 
formation  of  the  schedule  outside  the  main  iteration  loop,  as  the  indirection  arrays  are  live 
throughout  loop  body.  The  only  way  the  generation  of  the  schedules  (i.e.,  the  inspectors) 
can  be  moved  outside  a  block  of  code  is  if  it  can  be  ascertained  that  the  indirection  array, 
with  which  the  schedule  is  being  formed,  is  live  inside  the  whole  block.  This  can  only  be 
determined  through  global  dataflow  analysis. 

We  present  timings  that  result  from  generating  schedules  as  soon  as  the  necessary  in¬ 
dex  arrays  are  generated  and  timings  that  result  from  generating  a  new  schedule  for  each 


Scheduling  Method 

Time  per 
Iteration 
(seconds) 

Perfor¬ 

mance 

(Mflops) 

Preprocessing 

Time 

(seconds) 

Not  incremental, 
inside  of  main  loop 

6.91 

573 

273 

Not  incremental, 
outside  of  main  loop 

4.18 

947 

2.73 

Incremental, 
inside  of  main  loop 

5.64 

702 

299 

Incremental, 
outside  of  main  loop 

2.65 

1496 

2.99 

Table  1:  Explicit  Unstructured  Euler  Solver  on  804K  Mesh  on  512  Delta  Processors 

invocation  of  each  gather,  scatter  or  accumulate  procedure  call. 

Table  1  depicts: 

•  The  time  required  per  iteration, 

•  The  computational  rate  in  Mflops,  and 

•  The  preprocessing  time  needed  per  iteration  for  generating  all  communication  sched¬ 
ules. 

Comparing  the  times  for  scheduling  inside  the  loop  and  scheduling  only  once  in  the  compu¬ 
tation,  we  see  performance  improvements  ranging  from  65%  to  over  100%  The  preprocessing 
time  increases  only  modestly  when  we  use  incremental  scheduling  and  is  roughly  equal  to 
the  cost  of  a  single  parallelized  iteration.  Once  we  have  hoisted  schedule  generation  outside 
the  main  iteration  loop,  use  of  incremental  scheduling  leads  to  an  additional  58%  reduction 
in  total  time. 


9  Conclusions 

Communicating  the  right  data  at  the  right  time  and  place  is  a  difficult,  yet  crucial  task 
for  parallelizing  irregular  problems.  The  PARTI  primitives  are  valuable  tools  for  the  first 
part  of  the  problem,  namely  for  determining  where  to  find  which  data  and  for  efficient 
data  exchange.  The  dataflow  framework  presented  in  this  paper  is  designed  for  attacking 
the  second  part  of  the  problem,  namely  enabling  the  compiler  to  make  good  use  of  these 
primitives  without  further  advice  by  the  user.  We  believe  our  approach  to  be  effective  for  a 
wide  range  of  interesting  problems,  as  illustrated  for  an  explicit  unstructured  mesh  solver. 
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